Improved Input Data Splitting in MapReduce
نویسندگان
چکیده
The performance of MapReduce greatly depends on its data splitting process which happens before the map phase. This is usually done using naive methods which are not at all optimal. In this paper, an Improved Input Splitting technology based on locality is explained which aims at addressing the input data splitting problems which affects the job performance seriously. Improved Input Splitting clusters data blocks from a same node into the same single partition, so that it is processed by one map task. This method avoids the time for slot reallocation and multiple tasks initializing. Experiment results demonstrated that this can improve the MapReduce processing performance largely than the traditional Hadoop implementation.
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملPDTSSE: A Scalable Parallel Decision Tree Algorithm Based on MapReduce
Parallel decision tree learning is an effective and efficient approach to scaling the decision tree to large data mining application. Aiming at large scale decision tree learning, we present a novel parallel decision tree learning algorithm in MapReduce framework, called PDTSSE (Parallel Decision Tree via Sampling Splitting points with Estimation). We first propose an estimation method for samp...
متن کاملCan one find External Source Input Expressions for which there exist Map Reduce Configurations?
An intention of MapReduce Sets for External Source Input expressions analysis has to suggest criteria how External Source Input expressions in External Source Input data can be defined in a meaningful way and how they should be compared. Similitude based MapReduce Sets for External Source Input Expression Analysis and MapReduce Sets for Assignment is expected to adhere to fundamental principles...
متن کاملAn Improved K-means Algorithm based on Mapreduce and Grid
The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according ...
متن کاملMapReduce with Deltas
The MapReduce programming model is extended conservatively to deal with deltas for input data such that recurrent MapReduce computations can be more efficient for the case of input data that changes only slightly over time. That is, the extended model enables more frequent re-execution of MapReduce computations and thereby more up-to-date results in practical applications. Deltas can also be pu...
متن کامل